Sales prediction model for State of Connecticut Real Estate Sales 2001-2022 GL¶
The Office of Policy and Management maintains a listing of all real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year. For each sale record, the file includes: town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment.
Source Data : https://catalog.data.gov/dataset/real-estate-sales-2001-2018
Return Home : https://johnkimaiyo.vercel.app/
Creating a prediction model using Python and Pandas involves several steps, including data preprocessing, exploratory data analysis, feature engineering, model selection, training, and evaluation.
Step 1: Import Necessary Libraries¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import joblib
Step 2: Load the Dataset¶
Real_Estate_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Sate of Connecticut Real Etstae Sales\Machine Learning\Source Data\Real_Estate_Sales_2001-2022_GL.csv")
# Display the first few rows of the dataset
print(Real_Estate_Sales_df.head())
C:\Users\jki\AppData\Local\Temp\ipykernel_20024\2726286131.py:1: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False. Real_Estate_Sales_df = pd.read_csv(r"C:\Users\jki\Desktop\Data Scence Projects\Sate of Connecticut Real Etstae Sales\Machine Learning\Source Data\Real_Estate_Sales_2001-2022_GL.csv")
Serial Number List Year Date Recorded Town Address \ 0 220008 2022 01/30/2023 Andover 618 ROUTE 6 1 2020348 2020 09/13/2021 Ansonia 230 WAKELEE AVE 2 20002 2020 10/02/2020 Ashford 390 TURNPIKE RD 3 210317 2021 07/05/2022 Avon 53 COTSWOLD WAY 4 200212 2020 03/09/2021 Avon 5 CHESTNUT DRIVE Assessed Value Sale Amount Sales Ratio Property Type Residential Type \ 0 139020.0 232000.0 0.5992 Residential Single Family 1 150500.0 325000.0 0.4630 Commercial NaN 2 253000.0 430000.0 0.5883 Residential Single Family 3 329730.0 805000.0 0.4096 Residential Single Family 4 130400.0 179900.0 0.7248 Residential Condo Non Use Code Assessor Remarks OPM remarks \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN Location 0 POINT (-72.343628962 41.728431984) 1 NaN 2 NaN 3 POINT (-72.846365959 41.781677018) 4 NaN
Step 3: Data Preprocessing¶
Before building the model, you need to preprocess the data. This includes handling missing values, converting data types, and encoding categorical variables if necessary.
# lets check for missing values
missing_values = Real_Estate_Sales_df.isna().sum()
print(missing_values)
Serial Number 0 List Year 0 Date Recorded 2 Town 0 Address 51 Assessed Value 0 Sale Amount 0 Sales Ratio 0 Property Type 382446 Residential Type 398389 Non Use Code 784178 Assessor Remarks 926401 OPM remarks 1084598 Location 799518 dtype: int64
# lets remove missing values
Real_Estate_Sales_df.dropna(subset=['Date Recorded','Address','Property Type','Residential Type','Non Use Code','Assessor Remarks','OPM remarks','Location'],inplace =True)
# lets confirm existance of missing values
missing_values = Real_Estate_Sales_df.isna().sum()
print(missing_values)
Serial Number 0 List Year 0 Date Recorded 0 Town 0 Address 0 Assessed Value 0 Sale Amount 0 Sales Ratio 0 Property Type 0 Residential Type 0 Non Use Code 0 Assessor Remarks 0 OPM remarks 0 Location 0 dtype: int64
# let check if we have unwanted negative values
Real_Estate_Sales_df.describe()
Serial Number | List Year | Assessed Value | Sale Amount | Sales Ratio | |
---|---|---|---|---|---|
count | 1.097629e+06 | 1.097629e+06 | 1.097629e+06 | 1.097629e+06 | 1.097629e+06 |
mean | 5.370357e+05 | 2.011218e+03 | 2.818016e+05 | 4.053146e+05 | 9.603926e+00 |
std | 7.526074e+06 | 6.773485e+00 | 1.657890e+06 | 5.143492e+06 | 1.801664e+03 |
min | 0.000000e+00 | 2.001000e+03 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
25% | 3.071300e+04 | 2.005000e+03 | 8.909000e+04 | 1.450000e+05 | 4.778667e-01 |
50% | 8.070600e+04 | 2.011000e+03 | 1.405800e+05 | 2.330000e+05 | 6.105663e-01 |
75% | 1.703410e+05 | 2.018000e+03 | 2.282700e+05 | 3.750000e+05 | 7.707200e-01 |
max | 2.000500e+09 | 2.022000e+03 | 8.815100e+08 | 5.000000e+09 | 1.226420e+06 |
# lets check for data types to perform numerical calculations
Real_Estate_Sales_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 560 entries, 527 to 1093559 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Serial Number 560 non-null int64 1 List Year 560 non-null int64 2 Town 560 non-null object 3 Address 560 non-null object 4 Assessed Value 560 non-null float64 5 Sale Amount 560 non-null float64 6 Sales Ratio 560 non-null float64 7 Year 560 non-null int32 8 Month 560 non-null int32 9 Day 560 non-null int32 dtypes: float64(3), int32(3), int64(2), object(2) memory usage: 41.6+ KB
print(Real_Estate_Sales_df.columns)
Index(['Serial Number', 'List Year', 'Town', 'Address', 'Assessed Value', 'Sale Amount', 'Sales Ratio', 'Year', 'Month', 'Day'], dtype='object')
# lets drop unwanted columns
# Columns to drop
columns_to_drop = ['Property Type', 'Residential Type','Non Use Code','Assessor Remarks','OPM remarks','Location','Address','Town','Serial Number','List Year']
Real_Estate_Sales_df.drop(columns=columns_to_drop, inplace=True,errors='ignore')
Real_Estate_Sales_df.head(5)
Assessed Value | Sale Amount | Sales Ratio | Year | Month | Day | |
---|---|---|---|---|---|---|
527 | 207700.0 | 400000.0 | 0.5193 | 2023 | 9 | 20 |
922 | 49700.0 | 29900.0 | 1.6622 | 2023 | 2 | 27 |
940 | 208230.0 | 423261.0 | 0.4920 | 2022 | 10 | 31 |
1345 | 169050.0 | 351000.0 | 0.4816 | 2023 | 3 | 30 |
1632 | 769900.0 | 3500000.0 | 0.2200 | 2023 | 4 | 26 |
Step 4: Exploratory Data Analysis (EDA)¶
# Summary statistics
print(Real_Estate_Sales_df.describe())
# Correlation matrix
print(Real_Estate_Sales_df.corr())
# Plotting the correlation matrix
import seaborn as sns
sns.heatmap(Real_Estate_Sales_df.corr(), annot=True, cmap='coolwarm')
plt.show()
Assessed Value Sale Amount Sales Ratio Year Month \ count 5.600000e+02 5.600000e+02 560.000000 560.000000 560.000000 mean 1.701261e+05 9.091623e+05 0.838064 2020.926786 6.951786 std 1.669984e+05 1.346897e+07 1.091531 1.850397 3.215486 min 0.000000e+00 4.500000e+03 0.000000 2017.000000 1.000000 25% 9.025750e+04 1.100000e+05 0.413374 2020.000000 4.000000 50% 1.416000e+05 2.140000e+05 0.625957 2021.000000 7.000000 75% 1.992075e+05 3.752500e+05 1.020047 2022.000000 10.000000 max 2.083410e+06 3.187900e+08 20.020000 2023.000000 12.000000 Day count 560.000000 mean 15.817857 std 9.149467 min 1.000000 25% 8.000000 50% 17.000000 75% 23.000000 max 31.000000 Assessed Value Sale Amount Sales Ratio Year Month \ Assessed Value 1.000000 0.048048 -0.046309 0.017759 -0.014153 Sale Amount 0.048048 1.000000 -0.040326 0.004436 -0.000965 Sales Ratio -0.046309 -0.040326 1.000000 -0.008175 0.013997 Year 0.017759 0.004436 -0.008175 1.000000 -0.140702 Month -0.014153 -0.000965 0.013997 -0.140702 1.000000 Day -0.026584 -0.045498 0.056925 0.012419 0.000674 Day Assessed Value -0.026584 Sale Amount -0.045498 Sales Ratio 0.056925 Year 0.012419 Month 0.000674 Day 1.000000
Step 5: Feature Engineering¶
Feature engineering involves creating new features or transforming existing ones to improve the model's performance.
# Display the first few rows after feature engineering
print(Real_Estate_Sales_df.head())
Assessed Value Sale Amount Sales Ratio Year Month Day 527 207700.0 400000.0 0.5193 2023 9 20 922 49700.0 29900.0 1.6622 2023 2 27 940 208230.0 423261.0 0.4920 2022 10 31 1345 169050.0 351000.0 0.4816 2023 3 30 1632 769900.0 3500000.0 0.2200 2023 4 26
Step 6: Splitting the Data¶
Split the data into training and testing sets
# Define features (X) and target (y)
X = Real_Estate_Sales_df.drop(['Sale Amount'], axis=1)
y = Real_Estate_Sales_df['Sale Amount']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)
(448, 5) (112, 5)
Step 7: Model Selection and Training¶
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Step 8: Model Evaluation¶
Evaluate the model's performance on the test data.
# Make predictions
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# Plot the actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()
Mean Squared Error: 1358146224171.0066
Step 9: Making Predictions¶
You can now use the trained model to make predictions on new data.
# Example: Predict on new data
new_data = pd.DataFrame({
'Assessed Value': [20770],
'Sales Ratio': [0.5],
'Year': [2026],
'Month': [5],
'Day': [15],
})
# Save the model to a file
joblib.dump(model, 'real_estate_sales_model.pkl')
predicted_sales = model.predict(new_data)
print(f'Predicted Total Sales: {predicted_sales[0]}')
Predicted Total Sales: 872499.7062873021